I found three government workforce data sets online from different public websites and downloaded them in *.csv format. You can view these data sets for yourself in my github repo. Unfortunately, these are not relational databases so joins do not make much sense; however, they do contain similar data types and columns so unions will be useful. I really enjoy using RStudio and Markdown, and I will be using dplyr from the tidyverse package to do most of my analytics, but when possible, I will also provide the SQL equivalent code.
I tried to keep a consistent naming convention across all of the data sets - lower case and separated by underscores instead of spaces. It seems that each data set contains a department variable and some sort of salary or wage value, which will make for a good comparison across the cities. The Marin data contains the most descriptive columns, allowing for more interesting analysis. Louisville is the least like the others: it looks at summarized values by department rather than individual. Let’s look at the departments:
In two of the data sets, employees occassionaly have duplicate rows. I would imagine this could happen due to changes in role, salary, or other status type changes. The SQL code to find these individuals would be:
Sample SQL to check for duplicate names
select names, count(*) as num_rows
from bloomington
group by names
having num_rows > 1
| 78 |
It would make sense to take the maximum value for each person, instead of having the duplicate rows.
Sample SQL to remove duplicate names
select names, departments, max(salaries) as salaries
from bloomington
group by names, departments
| 0 |
A similar check was run for Chicago, and there were no duplicates.
| 85 |
Finally, Marin is a multi-year data set. Check for duplicates within those years and determine the last date in the date range.
| 0 |
| max_year |
|---|
| 2020 |
The data sets are all slightly different, so each will have to be summarized in their own way.
Bloomington is a spare data set. It only consists of names, departments, and salaries.
Looking at the summary table, there are clearly some part-time positions that skew the data leftwards. Lots of these appear to be administrative positions that could be flagged manually by department. To save time, I am going to flag them by the 20% quantile.
## 25%
## 24198.6
Looking at the graph above, the 20% quantile is near $25,000, and we’ll flag part-time at that location and then rerun our summary tables.
One of the first things that is noticeable about Chicago’s data, is that we have salary and hourly workers. I am going to multiply the hours worked by the hourly rate (times 52 weeks) and put it into the salary column to get it all into one place.
Marin has no salary numbers and it spans multiple years of data. Let’s examine only one year, to make it 1:1 with the other data sets in regards to counts. Looking at the year 2020, we’ll examine the department by gender.
Here, we looked at 95% prediction interval ranges for these salaries by department to spot individual salaries out of the normal range. You can hover over a point to learn the person’s name who corresponds to that point on the graph.